Stratified sampling

In this script we are going to sample a dataset of words, but we'll sample carefully to avoid removing subsets with low occurrence In other words, we'll perform a stratified sampling

In our case we'll use the word length as the feature whose statistics we want to preserve, so we will perform an 1% sampling for each word length. And we'll change the sampling ratio for the long words (which are typically rare), so as to preserve them.

1. Stats extraction


In [1]:
# We start reading the file from HDFS into a RDD                                                        
words = sc.textFile( "hdfs:///user/{0}/data/quijote-words.txt".format(sc.sparkUser()) )
total_words = words.count()

In [2]:
# Now we compute the subclasses. This is done creating a (k,v) RDD, by mapping using word length as the key                                                 
wordsByLength = words.map( lambda x : (len(x),x) )

In [3]:
# Let's take a peek to see what we've got
wordsByLength.take( 10 )


Out[3]:
[(2, u'El'),
 (9, u'ingenioso'),
 (7, u'hidalgo'),
 (3, u'don'),
 (7, u'Quijote'),
 (2, u'de'),
 (2, u'la'),
 (6, u'Mancha'),
 (4, u'TASA'),
 (2, u'Yo')]

In [4]:
# Ok, check class cardinality
# Compute the fraction of words having each length                                          
lengthStats = words.map( lambda x : (len(x),1) ).reduceByKey( add ).sortByKey( False )

In [5]:
# Get those figures back to the driver program, since we need to work with them
stats = lengthStats.collect()
stats


Out[5]:
[(21, 1),
 (19, 2),
 (18, 2),
 (17, 8),
 (16, 41),
 (15, 112),
 (14, 339),
 (13, 844),
 (12, 1737),
 (11, 3247),
 (10, 7381),
 (9, 13106),
 (8, 20110),
 (7, 29962),
 (6, 36780),
 (5, 45829),
 (4, 36715),
 (3, 66388),
 (2, 89307),
 (1, 29303),
 (0, 2)]

As we suspected, long words are exceedingly rare, while short words are fairly common (there is a bogus class of length 0, probably containing a couple of empty strings that slipped our processing).

Out of curiosity, let's find the champions in word length


In [6]:
wordsByLength.filter( lambda x : x[0]>16 ).sortByKey(False).collect()


Out[6]:
[(21, u'bienintencionadamente'),
 (19, u'estraordinariamente'),
 (19, u'extraordinariamente'),
 (18, u'desembarazadamente'),
 (18, u'correspondi\xe9ndoles'),
 (17, u'estrech\xedsimamente'),
 (17, u'convirti\xe9ndoseles'),
 (17, u'asombradoPregunt\xf3'),
 (17, u'desalumbradamente'),
 (17, u'desagradecimiento'),
 (17, u'desagradecimiento'),
 (17, u'Desagradecimiento'),
 (17, u'desagradecimiento')]

2. Local processing

Ok, now let's manipulate the frequency stats to come up with the sampling fractions we want.

For training purposes, we'll use NumPy for this (though it's not actually necessary). So let's import it.


In [7]:
import numpy as np

In [8]:
# Now we convert our collected stats into a NumPy array
statsM = np.array( stats )

In [9]:
# How many words in total?
total = sum(statsM[::,1])
total


Out[9]:
381216

In [10]:
# And what are the probabilities?
from __future__ import division  # ensure we're using floating point division
fraction = statsM[::,1]/total

In [11]:
# Stack the fractions into our array
statsM2 = np.c_[statsM,fraction]

In [12]:
# See what we've got. To avoid that ugly scientific notation, we change NumPy presentation options
np.set_printoptions(suppress=True)
statsM2


Out[12]:
array([[    21.        ,      1.        ,      0.00000262],
       [    19.        ,      2.        ,      0.00000525],
       [    18.        ,      2.        ,      0.00000525],
       [    17.        ,      8.        ,      0.00002099],
       [    16.        ,     41.        ,      0.00010755],
       [    15.        ,    112.        ,      0.0002938 ],
       [    14.        ,    339.        ,      0.00088926],
       [    13.        ,    844.        ,      0.00221397],
       [    12.        ,   1737.        ,      0.00455647],
       [    11.        ,   3247.        ,      0.00851748],
       [    10.        ,   7381.        ,      0.01936173],
       [     9.        ,  13106.        ,      0.03437946],
       [     8.        ,  20110.        ,      0.05275225],
       [     7.        ,  29962.        ,      0.07859586],
       [     6.        ,  36780.        ,      0.09648074],
       [     5.        ,  45829.        ,      0.12021793],
       [     4.        ,  36715.        ,      0.09631023],
       [     3.        ,  66388.        ,      0.17414799],
       [     2.        ,  89307.        ,      0.23426876],
       [     1.        ,  29303.        ,      0.07686718],
       [     0.        ,      2.        ,      0.00000525]])

In [13]:
# Ok, as a general rule we want 1% of data for each category
sample_fractions = np.ones( fraction.shape )*0.01

In [15]:
# but: underrepresented categories (less than 100 instances) we want at 100%
# and very rare categories (less than 20 instances) we want in full
sample_fractions[ statsM2[:,1] < 100 ] = 0.1
sample_fractions[ statsM2[:,1] < 20 ] = 1

In [16]:
# Check those fractions
sample_fractions


Out[16]:
array([ 1.  ,  1.  ,  1.  ,  1.  ,  0.1 ,  0.01,  0.01,  0.01,  0.01,
        0.01,  0.01,  0.01,  0.01,  0.01,  0.01,  0.01,  0.01,  0.01,
        0.01,  0.01,  1.  ])

In [17]:
# Construct the dict "key:fraction" for the stratified sampling
s = dict( zip(map(int,statsM[:,0]),sample_fractions) )
s


Out[17]:
{0: 1.0,
 1: 0.01,
 2: 0.01,
 3: 0.01,
 4: 0.01,
 5: 0.01,
 6: 0.01,
 7: 0.01,
 8: 0.01,
 9: 0.01,
 10: 0.01,
 11: 0.01,
 12: 0.01,
 13: 0.01,
 14: 0.01,
 15: 0.01,
 16: 0.10000000000000001,
 17: 1.0,
 18: 1.0,
 19: 1.0,
 21: 1.0}

3.Sampling

Now we've got the items grouped by categories, and the fractions we want to sample for each category, let's do it


In [18]:
# Sample!
wordsSampledbyLength = wordsByLength.sampleByKey(False,s)

In [19]:
# See how many we've got overall
sampled = wordsSampledbyLength.count()
fraction = sampled/total_words
fraction


Out[19]:
0.009729392260555695

In [20]:
# Check amount per category
lengthStatsSampled = wordsSampledbyLength.mapValues( lambda x : 1 ).reduceByKey( add ).sortByKey( False )
lengthStatsSampled.collect()


Out[20]:
[(21, 1),
 (19, 2),
 (18, 2),
 (17, 8),
 (16, 3),
 (15, 1),
 (14, 2),
 (13, 8),
 (12, 19),
 (11, 31),
 (10, 72),
 (9, 97),
 (8, 179),
 (7, 293),
 (6, 346),
 (5, 487),
 (4, 368),
 (3, 668),
 (2, 848),
 (1, 272),
 (0, 2)]

In [21]:
# An example of what we have
wordsSampledbyLength.take(10)


Out[21]:
[(2, u'de'),
 (2, u'un'),
 (5, u'libro'),
 (1, u'y'),
 (6, u'dellos'),
 (11, u'acogimiento'),
 (2, u'de'),
 (3, u'tan'),
 (8, u'engendra'),
 (6, u'alguna')]

In [ ]: